Airbnb is an online vacation rental marketplace servicing a community of hosts and travellers. The diagram below shows the process of how Airbnb started with two individuals who could not pay for rent in 2007 to starting a company that reached US$10 billion valuation by 2014.In 2020, Airbnb went public with valuation of up to US$47 million. valuation of up to US$47 million
According to Airbnb, Airbnb has millions of listings in over 220 counties and regions across 100,000 cities. The data generated provides rich information, including structured data e.g. price and location, as well as unstructured data e.g. reviews and listing descriptions. While there are statistical and analytic tools available to derive insights using these data, these tools are often subscription-based and require technical knowledge, which may not be available or accessible to everyone. Hence, this project aims to develop an interface which is concise, interactive, and user-friendly using R Shiny. With this interface, data-based decisions can be made from the interactive GUI. The R Shiny app will cover exploratory data analysis, confirmatory data analysis, text mining, as well as predictive analysis.
This assignment is sub-module of our final Shiny-based Visual Analytics Application (Shiny-VAA). In particular, a focus on text mining utilising various R packages will be presented. The process is shown below:
Our application can be used from both the perspective of hosts and guests.
Hosts: In 2014, Airbnb launched the Superhost programme to reward hosts with outstanding hospitality. As a Superhost, one will have better earnings, more visibility, and are able to earn exclusive rewards such as increased earnings compared to regular hosts. To become a Superhost, these are the criteria to be met: - 4.8 or higher overall rating based on reviews - Completed at least 10 stays in the past year or 100 nights over at least 3 completed stays - Less than 1% cancellation rate, not including extenuating circumstances - Responds to 90% of new messages within 24 hours
Guests: With over 60,000 members and 6,000 properties listed on Airbnb website, a dilemma on which is the right space might be of concern to users. Various modules in our dashboard will allow both types of users to analyse Airbnb data according to their needs.
InsideAirbnb provides tools and data for users to explore Airbnb. We will be using the following files: - listing.csv.gz: This dataset consists of 74 variables and 4256 data points.
- reviews.csv.gz: This dataset provides 6 variables and 52368 data points.
While the team has decided to use the latest set of data compiled on 27 January 2021, this report uses data compiled on 29 December 2020 for completeness.
Conducting literature review on how the analysis were performed before. The focus should be on identifying gaps whereby interactive web approach and visual analytics techniques can be used to enhance user experience on using the analysis techniques.
Airbnb data has been widely used for text mining in tools like Python and R. In Python, (Natural Language Processing Toolkit)[https://www.nltk.org/] has easy-to-use interfaces to over 50 corpora and lexical resource, as well as a wide range of text processing libraries for tokenisation, stemming, classification etc. Similarly, R has extensive libraries such as tidyverse and Shiny which allows for text mining and building of interactive dashboards.
Zhang (2019) used text mining approaches including content analysis and topic modelling (Latent Dirichlet Allocation (LDA) method) to examine over 1 million Airbnb reviews across 50,933 listings in the United States of America (USA). Kiatkawsin, Sutherland & Kim (2020) also used LDA method to compare reviews between Hong Kong and Singapore. However, these articles do not provide visualiation of the methods used and are not interactive.
Kim’s Shiny Airbnb App provided dashboard which is interactive for Exploratory Data Analysis (EDA), but left out reviews. [Ankit Pandey] (https://github.com/ankit2web/Twitter-Sentiment-Analysis-using-R-Shiny-WebApp) provided a more comprehensive text analytics dashboard using wordcloud and polarity of sentiments, but does not provide much interactivity.
To solve the above gaps, the next section outlines the steps:
Extracting, wrangling and preparing the input data required to perform the analysis. The focus should be on exploring appropriate tidyverse methods
runtime:shiny was added to allow dynamic documentation. {r} part of the code chunk can be used to specify elements and subsequently rendered into different format. echo=TRUE is set to allow printing of code chunk when rendered into a different file format. More details can be found at R Markdown Documentation.
To install multiple packages and load the libraries, run the following codes chunk:
packages <- c("tidyverse","sf","tmap","crosstalk","leaflet","RColorBrewer","ggplot2","rgdal", "rgeos", "raster", "maptools","tmaptools","shiny","tidytext","wordcloud","wordcloud2","tm","ggthemes","igraph","ggmap","DT","reshape2","ggraph","topicmodels","tidytext","topicmodels","quanteda","tm","RColorBrewer","DataExplorer")
for (p in packages){
if (!require(p,character.only=T)){
install.packages(p)
}
library(p, character.only=T)
}
## Loading required package: tidyverse
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: sf
## Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
## Loading required package: tmap
## Loading required package: crosstalk
## Loading required package: leaflet
## Loading required package: RColorBrewer
## Loading required package: rgdal
## Loading required package: sp
## rgdal: version: 1.5-23, (SVN revision 1121)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.2.1, released 2020/12/29
## Path to GDAL shared files: C:/Users/joeyc/OneDrive/Documents/R/win-library/4.0/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ runtime: Rel. 7.2.1, January 1st, 2021, [PJ_VERSION: 721]
## Path to PROJ shared files: C:/Users/joeyc/OneDrive/Documents/R/win-library/4.0/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.4-5
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.
## Overwritten PROJ_LIB was C:/Users/joeyc/OneDrive/Documents/R/win-library/4.0/rgdal/proj
## Loading required package: rgeos
## rgeos version: 0.5-5, (SVN revision 640)
## GEOS runtime version: 3.8.0-CAPI-1.13.1
## Linking to sp version: 1.4-5
## Polygon checking: TRUE
## Loading required package: raster
##
## Attaching package: 'raster'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:tidyr':
##
## extract
## Loading required package: maptools
## Checking rgeos availability: TRUE
## Loading required package: tmaptools
## Loading required package: shiny
##
## Attaching package: 'shiny'
## The following object is masked from 'package:crosstalk':
##
## getDefaultReactiveDomain
## Loading required package: tidytext
## Loading required package: wordcloud
## Loading required package: wordcloud2
## Loading required package: tm
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
## Loading required package: ggthemes
## Loading required package: igraph
##
## Attaching package: 'igraph'
## The following object is masked from 'package:raster':
##
## union
## The following object is masked from 'package:rgeos':
##
## union
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
## Loading required package: ggmap
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
## Loading required package: DT
##
## Attaching package: 'DT'
## The following objects are masked from 'package:shiny':
##
## dataTableOutput, renderDataTable
## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Loading required package: ggraph
##
## Attaching package: 'ggraph'
## The following object is masked from 'package:sp':
##
## geometry
## Loading required package: topicmodels
## Warning: package 'topicmodels' was built under R version 4.0.5
## Loading required package: quanteda
## Warning: package 'quanteda' was built under R version 4.0.5
## Package version: 2.1.2
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:igraph':
##
## as.igraph
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following objects are masked from 'package:NLP':
##
## meta, meta<-
## The following object is masked from 'package:utils':
##
## View
## Loading required package: DataExplorer
## Warning: package 'DataExplorer' was built under R version 4.0.5
Use the read_csv() function to determine the path to the file to read. It prints out a column specification that gives the name and type of each column. As the are unnecessary columns, select() function is use to retain only the columns used in subsequent analysis. - review file contains 52367 observations with 6 variables; 2 columns (listing_id and comments) are retained. - listings file contains 4255 observations with 74 variables; 33 columns are retained.
reviews <- read_csv("C:/Users/joeyc/blog/_posts/2021-03-29-assignment/data/reviews.csv")%>%
dplyr::select(listing_id,comments)
##
## -- Column specification --------------------------------------------------------
## cols(
## listing_id = col_double(),
## id = col_double(),
## date = col_date(format = ""),
## reviewer_id = col_double(),
## reviewer_name = col_character(),
## comments = col_character()
## )
listings <- read_csv("C:/Users/joeyc/blog/_posts/2021-03-29-assignment/data/listings.csv") %>%
rename(listing_id=id) %>%
dplyr::select(-c(listing_url, scrape_id, last_scraped, name, picture_url,host_url, host_about,host_thumbnail_url, host_picture_url, host_listings_count, host_verifications,calendar_updated,first_review,last_review,license,neighborhood_overview,description,host_total_listings_count,host_has_profile_pic,availability_30,availability_60,availability_90,availability_365,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,number_of_reviews_ltm,number_of_reviews_l30d,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_last_scraped,has_availability,instant_bookable))
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## listing_url = col_character(),
## last_scraped = col_date(format = ""),
## name = col_character(),
## description = col_character(),
## neighborhood_overview = col_character(),
## picture_url = col_character(),
## host_url = col_character(),
## host_name = col_character(),
## host_since = col_date(format = ""),
## host_location = col_character(),
## host_about = col_character(),
## host_response_time = col_character(),
## host_response_rate = col_character(),
## host_acceptance_rate = col_character(),
## host_is_superhost = col_logical(),
## host_thumbnail_url = col_character(),
## host_picture_url = col_character(),
## host_neighbourhood = col_character(),
## host_verifications = col_character(),
## host_has_profile_pic = col_logical()
## # ... with 17 more columns
## )
## i Use `spec()` for the full column specifications.
right_join() is used to merge the listings and review files so that all rows from listings will be returned.
data <- right_join(reviews,listings,by="listing_id")
To write to CSV for future usage, run the following code without hashtag(#).
#write.csv(data,"data.csv")
glimpse(data)
## Rows: 54,074
## Columns: 34
## $ listing_id <dbl> 49091, 50646, 50646, 50646, 50646, 50646,~
## $ comments <chr> "Fran was absolutely gracious and welcomi~
## $ host_id <dbl> 266763, 227796, 227796, 227796, 227796, 2~
## $ host_name <chr> "Francesca", "Sujatha", "Sujatha", "Sujat~
## $ host_since <date> 2010-10-20, 2010-09-08, 2010-09-08, 2010~
## $ host_location <chr> "Singapore", "Singapore, Singapore", "Sin~
## $ host_response_time <chr> "within a few hours", "a few days or more~
## $ host_response_rate <chr> "100%", "0%", "0%", "0%", "0%", "0%", "0%~
## $ host_acceptance_rate <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A",~
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,~
## $ host_neighbourhood <chr> "Woodlands", "Bukit Timah", "Bukit Timah"~
## $ host_identity_verified <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,~
## $ neighbourhood <chr> NA, "Singapore, Singapore", "Singapore, S~
## $ neighbourhood_cleansed <chr> "Woodlands", "Bukit Timah", "Bukit Timah"~
## $ neighbourhood_group_cleansed <chr> "North Region", "Central Region", "Centra~
## $ latitude <dbl> 1.44255, 1.33235, 1.33235, 1.33235, 1.332~
## $ longitude <dbl> 103.7958, 103.7852, 103.7852, 103.7852, 1~
## $ property_type <chr> "Private room in apartment", "Private roo~
## $ room_type <chr> "Private room", "Private room", "Private ~
## $ accommodates <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,~
## $ bathrooms <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ bathrooms_text <chr> "1 bath", "1 bath", "1 bath", "1 bath", "~
## $ bedrooms <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ beds <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ amenities <chr> "[\"Washer\", \"Elevator\", \"Long term s~
## $ price <chr> "$80.00", "$80.00", "$80.00", "$80.00", "~
## $ number_of_reviews <dbl> 1, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18~
## $ review_scores_rating <dbl> 94, 91, 91, 91, 91, 91, 91, 91, 91, 91, 9~
## $ review_scores_accuracy <dbl> 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
## $ review_scores_cleanliness <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1~
## $ review_scores_checkin <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1~
## $ review_scores_communication <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1~
## $ review_scores_location <dbl> 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,~
## $ review_scores_value <dbl> 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,~
glimpse() does not present the data in a tabular format, hence datatable and kable packages were considered.However, - datatable() does not work well with the extensions of FixedColumns, FixedHeader and Scoller when coupled with Shiny. Hence, these specific functionalities are excluded. - kable() is not up to date with the current version of R and was not used.
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html